PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

related work.tex (40403B)


      1 \section{Related Work}
      2 \label{sec:graph:related work}
      3 In the previous section, we show that the attributed multigraph encoding we introduced in Section~\ref{sec:graph:encoding} can help us leverage additional information for the relation extraction task.
      4 In this section, we present the existing framework for computing distributed representations of graphs.
      5 In most cases, these process simple undirected graphs \(G=(V, E)\).
      6 Still, these methods are applicable to our relation extraction multigraph with some modifications, as shown in Sections~\ref{sec:graph:r-gcn} and~\ref{sec:graph:approach}.
      7 
      8 The use of graphs in deep learning has seen a recent surge of interest over the last few years.
      9 This produced a set of models known as graph neural networks (\textsc{gnn}) and graph convolutional networks (\textsc{gcn}).%
     10 \sidenote{
     11 	The term \textsc{gcn} is used with different meanings by various authors.
     12 	\textsc{gcn}s are always \textsc{gnn}s, but the reverse is not true.
     13 	However, in practice, the \textsc{gnn}s we describe in this section can essentially be described as \textsc{gcn}s.
     14 	We use the term \textsc{gcn} to describe models whose purpose is to have a similar function on graphs as \textsc{cnn}s have on images.
     15 	Some authors only refer to the model of \textcite{gcn_spectral_semi} described in Section~\ref{sec:graph:spectral gcn} as a \textsc{gcn}.
     16 	In this case, what we call \textsc{gcn} can be called conv\textsc{gnn} (convolutional graph neural networks).
     17 	In any case, \textsc{gnn} and \textsc{gcn} can be considered almost synonymous for the purpose of this thesis since we don't describe any exotic \textsc{gnn} which clearly falls outside of the realm of \textsc{gcn}.
     18 	\label{note:graph:gcn vs gnn}
     19 }
     20 While the first works on \textsc{gnn} started more than twenty years ago \parencite{gnn_early}, we won't go into a detailed historical review, and we exclusively focus on recent models.
     21 Note that we already presented an older graph-based approach in Section~\ref{sec:relation extraction:label propagation}, the label propagation algorithm.
     22 We also discussed \textsc{epgnn} in Section~\ref{sec:relation extraction:epgnn}, which is a model built on top of a \textsc{gcn}.
     23 We further draw parallels between \textsc{epgnn} and our proposed approach in Section~\ref{sec:graph:topological features}.
     24 
     25 The thread of reasoning behind this section is as follows:
     26 \begin{itemize}[nosep]
     27 	\item We present the ``usual'' way to process graphs (Sections~\ref{sec:graph:random walk}--\ref{sec:graph:r-gcn}).
     28 	\item We present the theory behind these methods (Section~\ref{sec:graph:weisfeiler-leman}).
     29 	\item We show how this theoretical background can help us design a new approach specific to the unsupervised relation extraction task (Section~\ref{sec:graph:approach}).
     30 \end{itemize}
     31 In this related work overview, we mainly describe algorithms working on standard \(G=(V, E)\) graphs, not the labeled multigraphs of Section~\ref{sec:graph:encoding}, with the exception of Section~\ref{sec:graph:r-gcn}.
     32 We start by quickly describing models based on random walks in Section~\ref{sec:graph:random walk}; these are spatial methods which serve as a gentle introduction to the manipulation of graphs by neural networks.
     33 Furthermore, they were influential in the development of subsequent models and in our preliminary analysis with computation of path statistics (Section~\ref{sec:graph:analysis}), which allows us to draw parallels with more modern approaches.
     34 We then introduce the two main classes of \textsc{gcn}---which consequently are also the two main classes of \textsc{gnn}---used nowadays: spectral (Section~\ref{sec:graph:spectral gcn}) and spatial (Section~\ref{sec:graph:spatial gcn}).
     35 Apart from the few works mentioned in Chapter~\ref{chap:relation extraction}, \textsc{gnn}s were seldom used for relation extraction.
     36 We, therefore, focus on the evaluation of \textsc{gnn} on an entity classification task, which while different from our problem, works on similar data.
     37 In Section~\ref{sec:graph:r-gcn}, we describe models designed to handle relational data in a knowledge base, in particular \textsc{r-gcn}.
     38 We close this related work with a presentation of the Weisfeiler--Leman isomorphism test in Section~\ref{sec:graph:weisfeiler-leman}; it serves as a theoretical motivation behind both \textsc{gcn}s and our proposed approach.
     39 
     40 \subsection{Random-Walk-Based Models}
     41 \label{sec:graph:random walk}
     42 DeepWalk \parencitex{deepwalk} is a method to learn vertex representations from the structure of the graph alone.
     43 The representations encode how likely it is for two vertices to be close to each other in the graph.
     44 To this end, DeepWalk models the likelihood of random walks in the graph (Section~\ref{sec:graph:analysis}).
     45 These walks are simply sequences of vertices.
     46 To obtain a distributed representation out of them, we can use the \textsc{nlp} approaches of Sections~\ref{sec:context:word} and~\ref{sec:context:sentence} by treating the set of vertices as the vocabulary \(V=\entitySet\).
     47 In particular, DeepWalk uses the skip-gram model of Word2vec (Section~\ref{sec:context:skip-gram}), using hierarchical softmax to approximate the partition function over all words---i.e.~vertices.
     48 Vertices part of the same random walk are used as positive examples.
     49 In the same way that learning representations to predict the neighborhood of a word gives good word representations, modeling the neighborhood of a vertex gives good vertex representations.
     50 
     51 \Textcite{deepwalk} evaluate their model on a node classification task.
     52 For example, one of the datasets they use is BlogCatalog \parencite{blogcatalog}, where vertices correspond to blogs, edges are built from social network connections between the various bloggers, and predicted labels are the set of topics on which each blog focuses.
     53 DeepWalk is a transductive method but was extended into an inductive approach called planetoid \parencitex{planetoid}.
     54 Planetoid also proposes an evaluation on an entity classification task performed on the \textsc{nell} dataset.
     55 The goal of this task is to find the type of an entity (e.g.~person, organization, location\dots) in a knowledge base (Section~\ref{sec:context:knowledge base}).
     56 To this end, a special bipartite%
     57 \sidenote{
     58 	A bipartite graph is a graph \(G=(V, E)\) where the vertices can be split into two disjoint sets \(V_1\cup V_2=V\) such that all edges \(e\in E\) have one endpoint in \(V_1\) and one endpoint in \(V_2\).
     59 }
     60 graph \(G_\textsc{b} = (V_\textsc{b}, E_\textsc{b})\) is constructed where \(V_\textsc{b}=\entitySet\cup\relationSet\) and:
     61 \begin{equation*}
     62 	E_\textsc{b} = \big\{\, \{e, r\}\subseteq V_\textsc{b} \mathrel{\big|} \exists e'\in\entitySet : (e, r, e') \in\kbSet\lor (e', r, e)\in\kbSet \,\big\}
     63 \end{equation*}
     64 This clearly assumes \hypothesis{biclique}: for each relation the information of ``which \(e_1\)'' corresponds to ``which \(e_2\)'' is discarded.
     65 However this information is not as crucial for entity classification as it is for relation extraction.
     66 A small example of graph \(G_\textsc{b}\) obtained this way is given in Figure~\ref{fig:graph:nell bipartite}.
     67 The model is trained by jointly optimizing the negative sampling loss and the the log-likelihood of labeled examples.
     68 On unseen entities, planetoid reach an accuracy of 61.9\% when only 0.1\% of entities are labeled.
     69 
     70 \begin{marginfigure}
     71 	\centering
     72 	\input{mainmatter/graph/nell bipartite.tex}
     73 	\scaption[\textsc{nell} dataset bipartite graph.]{
     74 		\textsc{nell} dataset bipartite graph.
     75 		Entities are on the left, while relation slots are on the right.
     76 		In this graph, the edges are left unlabeled.
     77 		\label{fig:graph:nell bipartite}
     78 	}
     79 \end{marginfigure}
     80 
     81 Using random walks allows DeepWalk and planetoid to leverage the pre-existing \textsc{nlp} literature.
     82 However, for each sample, only a small fraction of the neighborhood---two neighbors at most---of each node is considered to make a prediction.
     83 Subsequent methods focused on modeling the information of the whole neighborhood jointly.
     84 
     85 \subsection{Spectral \textsc{gcn}}
     86 \label{sec:graph:spectral gcn}
     87 The first approaches to successfully model the neighborhood of vertices jointly were based on spectral graph theory \parencite{gcn_spectral_early}.
     88 In practice, this means that the graph is manipulated through its Laplacian matrix instead of directly through the adjacency matrix.
     89 In this section, we base our presentation of spectral methods on the work of \textcitex{gcn_spectral_semi}[-11mm].
     90 
     91 \begin{marginparagraph}[-3mm]
     92 	The graph Laplacian is similar to the standard Laplacian measuring the divergence of the gradient (\(\laplace = \nabla^2\)) of scalar functions.
     93 	Except that the graph gradient is an operator mapping a function on vertices to a function on edges:
     94 	\begin{equation*}
     95 		(\nabla \vctr{f})_{ij} = f_i - f_j
     96 	\end{equation*}
     97 	And that the graph divergence is an operator mapping a function on edges to a function on vertices:
     98 	\begin{equation*}
     99 		(\operatorname{div} \mtrx{G})_i = \sum_{j\in V} m_{ij} g_{ij}
    100 	\end{equation*}
    101 	Given these definitions, the graph Laplacian is defined as \(\laplace = - \operatorname{div} \nabla\).
    102 	Applying \(\laplace\) to a signal \(\vctr{x}\in\symbb{R}^n\) is equivalent to multiplying this signal by \(\laplacian{c}\) as defined in Equation~\ref{eq:graph:laplacian}: \(\laplace \vctr{x} = \laplacian{c} \vctr{x}\).
    103 \end{marginparagraph}
    104 
    105 We start by introducing some basic concepts from spectral graph theory used to define the convolution operator on graphs.
    106 The Laplacian of an undirected graph \(G=(V, E)\) can be defined as:
    107 \begin{equation}
    108 	\laplacian{c} = \mtrx{D} - \mtrx{M},
    109 	\label{eq:graph:laplacian}
    110 \end{equation}
    111 where \(\mtrx{D}\in\symbb{R}^{n\times n}\) is the diagonal matrix of vertex degrees \(d_{ii} = \gfdegree(v_i)\) and \(\mtrx{M}\in\symbb{R}^{n\times n}\) is the adjacency matrix.
    112 Equation~\ref{eq:graph:laplacian} defines the combinatorial Laplacian; however, spectral \textsc{gcn}s are usually defined on the normalized symmetric Laplacian:
    113 \begin{equation*}
    114 	\laplacian{sym} = \mtrx{D}^{-1\fracslash 2} \laplacian{c} \mtrx{D}^{-1\fracslash 2} = \mtrx{I} - \mtrx{D}^{-1\fracslash 2} \mtrx{M} \mtrx{D}^{-1\fracslash 2}.
    115 \end{equation*}
    116 
    117 Using this definition, we can then take the eigendecomposition of the Laplacian \(\laplacian{sym}=\mtrx{U}\mtrx{\Lambda}\mtrx{U}^{-1}\), where \(\mtrx{\Lambda}\) is the ordered spectrum---the diagonal matrix of eigenvalues sorted in increasing order---and \(\mtrx{U}\) is the matrix of normalized eigenvectors.
    118 For an undirected graph, the matrix \(\mtrx{M}\) is symmetric, therefore \(\mtrx{U}\) is orthogonal.
    119 The orthonormal space formed by the normalized eigenvectors is the Fourier space of the graph.
    120 In other words, we can define the graph Fourier transform of a signal \(\vctr{x}\in\symbb{R}^V\) as:
    121 \begin{marginparagraph}[-1cm]
    122 	The expansion of signals in terms of eigenfunctions of the Laplace operator is the leading parallel between the graph Fourier transform and the classical Fourier transform on \(\symbb{R}\) \parencite{graph_fourier}.
    123 	In \(\symbb{R}\), the eigenfunctions \(\xi\mapsto e^{2\pi i\xi x}\) correspond to low frequencies when \(x\) is small.
    124 	In the same way, the eigenvectors of the graph Laplacian associated with small eigenvalues assign similar values to neighboring vertices.
    125 	In particular the eigenvector associated with the eigenvalue 0 is constant with value \(1\divslash\sqrt{n}\).
    126 	On the other hand, eigenvectors associated with large eigenvalues correspond to high frequencies and encode larger changes of value between neighboring vertices.
    127 \end{marginparagraph}
    128 \begin{equation*}
    129 	\gffourier(\vctr{x}) = \mtrx{U}\transpose \vctr{x}.
    130 \end{equation*}
    131 Furthermore since the induced space is orthogonal, the inverse Fourier transform is simply defined as:
    132 \begin{equation*}
    133 	\gfinvfourier(\vctr{x}) = \mtrx{U} \vctr{x}.
    134 \end{equation*}
    135 Having defined the Fourier transform on graphs, we can use the definition of convolutions as multiplications in the Fourier domain to define convolution on graphs:
    136 \begin{equation}
    137 	\vctr{x} * \vctr{w} = \gfinvfourier(\gffourier(\vctr{x}) \odot \gffourier(\vctr{w})),
    138 	\label{eq:graph:convolution}
    139 \end{equation}
    140 where \(\odot\) denotes the Hadamard (element-wise) product.
    141 Note that the convolution operator implicitly depends on the graph \(G\) since \(\mtrx{U}\) is defined from the adjacency matrix \(\mtrx{M}\).
    142 The signal \(\vctr{w}\) in Equation~\ref{eq:graph:convolution} has the same function as the parametrized filter of \textsc{cnn} (Equation~\ref{eq:context:convolution}).
    143 Instead of learning \(\vctr{w}\) in the spatial domain, we can directly parametrize its Fourier transform \(\vctr{w}_\vctr{\theta} = \diagonal(\gffourier(\vctr{w}))\), %
    144 \begin{marginparagraph}
    145 	\(\diagonal(\vctr{x})\) is the diagonal matrix with values of the vector \(\vctr{x}\) along its diagonal.
    146 \end{marginparagraph}
    147 simplifying Equation~\ref{eq:graph:convolution} into:
    148 \begin{equation}
    149 	% ( ͡° ͜ʖ ͡°) UwU
    150 	\vctr{x} * \vctr{w}_\vctr{\theta} = \mtrx{U} \vctr{w}_\vctr{\theta} \mtrx{U}\transpose \vctr{x}.
    151 	\label{eq:graph:filter convolution}
    152 \end{equation}
    153 While \(\vctr{w}_\vctr{\theta}\) could be learned directly \parencite{gcn_spectral_early}, \textcite{chebnet} propose to approximate it by Chebyshev polynomials of the first kind (\(T_k\)) of the spectrum \(\mtrx{\Lambda}\):
    154 \begin{equation}
    155 	\vctr{w}_\vctr{\theta}(\mtrx{\Lambda}) = \sum_{k=0}^{K} \theta_k T_k(\mtrx{\Lambda}).
    156 	\label{eq:graph:filter decomposition}
    157 \end{equation}
    158 The rationale is that computing the eigendecomposition of the graph Laplacian is too computationally expensive.
    159 The Chebyshev polynomials approximation is used to localize the filter; since the \(k\)-th Chebyshev polynomial is of degree \(k\), only values of vertices at a distance of at most \(k\) are needed.%
    160 \sidenote[][-39mm]{
    161 	The reasoning behind this localization is the same as the one underlying the fact that the \(k\)-th power of the adjacency matrix gives the number of walks of length \(k\) (Section~\ref{sec:graph:analysis}).
    162 }
    163 This is similar to how \textsc{cnn}s are usually computed; simple very localized filters are used instead of taking the Fourier transform of the whole input matrix to compute convolution with arbitrarily complex functions.
    164 Chebyshev polynomials of the first kind are defined as:
    165 \begin{marginparagraph}[-32mm]
    166 	Despite its appearance, Equation~\ref{eq:graph:chebyshev cos} defines a series of polynomials which can be obtained through the application of various trigonometric identities.
    167 	An alternative but equivalent definition is through the following recursion:
    168 	\begin{align*}
    169 		T_0(x) & = 1 \\
    170 		T_1(x) & = x \\
    171 		T_{k+1}(x) & = 2x T_k(x) - T_{k-1}(x)
    172 	\end{align*}
    173 	The plot of the first five Chebyshev polynomials of the first kind follows:
    174 	\input{mainmatter/graph/chebyshev.tex}
    175 \end{marginparagraph}
    176 \begin{equation}
    177 	T_k(\cos x) = \cos(k x).
    178 	\label{eq:graph:chebyshev cos}
    179 \end{equation}
    180 They form a sequence of orthogonal polynomials on the interval \([-1, 1]\) with respect to the weight \(1\divslash\sqrt{1-x^2}\), meaning that for \(k\neq k'\):
    181 \begin{equation*}
    182 	\int_{-1}^1 T_k(x) T_{k'}(x) \frac{\diff x}{\sqrt{1-x^2}} = 0.
    183 \end{equation*}
    184 
    185 The filter defined by Equation~\ref{eq:graph:filter decomposition} is \(K\)-localized, meaning that the value of the output signal on a vertex \(v\) is computed from the value of \(\vctr{x}\) on vertices at distance at most \(K\) of \(v\).
    186 This can be seen by plugging Equation~\ref{eq:graph:filter decomposition} back into Equation~\ref{eq:graph:filter convolution}, noticing that it depends on the \(k\)-th power of the Laplacian and thus of the adjacency matrix.%
    187 \sidenotemark% XXX No more space for sidenote on this page
    188 
    189 \leavevmode
    190 \sidenotetext{% XXX No more space for sidenote on previous page
    191 	Derivation of the dependency on \(\laplacian{sym}^k\) for the proof of \(K\)-locality:
    192 	\begin{align*}
    193 		\vctr{x} * \vctr{w}_\vctr{\theta}(\mtrx{\Lambda})
    194 			& = \mtrx{U} \left( \sum_{k=0}^{K} \theta_k T_k(\mtrx{\Lambda}) \right) \mtrx{U}\transpose \vctr{x} \\
    195 			& = \left( \sum_{k=0}^{K} \theta_k \mtrx{U} T_k(\mtrx{\Lambda}) \mtrx{U}\transpose \right) \vctr{x} \\
    196 			& = \left( \sum_{k=0}^{K} \theta_k T_k(\laplacian{sym}) \right) \vctr{x}
    197 	\end{align*}
    198 	For the last equality, notice that \(\laplacian{sym}^k = (\mtrx{U}\mtrx{\Lambda}\mtrx{U}\transpose)^k = \mtrx{U}\mtrx{\Lambda}^k\mtrx{U}\transpose\) since \(\mtrx{U}\) is orthogonal.
    199 	This can also be applied to the (diagonal) constant term.
    200 }%
    201 \Textcite{gcn_spectral_semi} proposed to use \(K=1\) with several further optimizations we won't delve into.
    202 Using \(K=1\) means that their method computes the activation of a node only from its activation and the activations of its neighbors at the previous layer.
    203 This makes the \textsc{gcn} of \textcite{gcn_spectral_semi} quite similar to spatial methods described in Section~\ref{sec:graph:spatial gcn}.
    204 All the equations given thus far were for a single scalar signal; however, we usually work with vector representations for all nodes, \(\mtrx{X}\in\symbb{R}^{n\times d}\).
    205 In this case, the layer \(\ell\) of a \textsc{gcn} can be described as:
    206 \begin{equation*}
    207 	\mtrx{H}^{(\ell+1)} = \ReLU\left((\mtrx{D}+\mtrx{I})^{-1\fracslash2} (\mtrx{M}+\mtrx{I}) (\mtrx{D}+\mtrx{I})^{-1\fracslash2} \mtrx{H}^{(\ell)} \mtrx{\Theta}^{(\ell)}\right)
    208 \end{equation*}
    209 Where \(\mtrx{\Theta}\in\symbb{R}^{d\times d}\) is the parameter matrix.
    210 Using \(\mtrx{H}^{(0)} = \mtrx{X}\), we can use a \textsc{gcn} with \(L\) layers to combine the embeddings in the \(L\)-localized neighborhood of each vertex into a contextualized representation.
    211 
    212 \Textcite{gcn_spectral_semi} evaluate their model on the same \textsc{nell} dataset used by planetoid with the same 0.1\% labeling rate.
    213 They train their model by maximizing the log-likelihood of labeled examples.
    214 They obtain an accuracy of 66.0\%, which is an increase of 4.9 points over planetoid.
    215 
    216 \subsection{Spatial \textsc{gcn}}
    217 \label{sec:graph:spatial gcn}
    218 \begin{marginfigure}
    219 	\centering
    220 	\input{mainmatter/graph/graph convolution parallel.tex}
    221 	\scaption[Parallel between two-dimensional \textsc{cnn} data and \textsc{gcn} data.]{
    222 		Parallel between two-dimensional \textsc{cnn} data and \textsc{gcn} data.
    223 		\label{fig:graph:graph convolution parallel}
    224 	}
    225 \end{marginfigure}
    226 \sidecite{graphsage}% XXX Insert the citation between the figure and the 1-dim CNN sidenote
    227 
    228 Spatial methods directly draw from the comparison with \textsc{cnn} in the spatial domain.
    229 As shown by Figure~\ref{fig:graph:graph convolution parallel}, the lattice on which a 2-dimensional%
    230 \sidenote{
    231 	Even though the same comparison could be made with 1-dimensional \textsc{cnn} as introduced in Section~\ref{sec:context:cnn}, the similarity is less visually striking.
    232 	Especially when considering a filter of width 3, in which case the equivalent graph is a simple path graph: \input{mainmatter/graph/path graph.tex}\kern-1mm.
    233 }
    234 \textsc{cnn} is applied can be seen as a graph with a highly regular connectivity pattern.
    235 In this section, we introduce spatial \textsc{gcn} by following the Graph\textsc{sage} model \parencite{graphsage}.
    236 
    237 When computing the activation of a specific node with a \textsc{cnn}, the filter is centered on this node, and each neighbor is multiplied with a corresponding filter element.
    238 The products are then aggregated by summation.
    239 Spatial \textsc{gcn}s purpose to mimic this process.
    240 The main obstacle to generalizing this spatial view of convolutions to graphs is the irregularity of neighborhoods.%
    241 \sidenote{
    242 	Interestingly enough, this is also a problem with standard \textsc{cnn}s when dealing with values at the edges of the matrix.
    243 }
    244 In a graph, nodes have different numbers of neighbors; a fixed-size filter cannot be used.
    245 Graph\textsc{sage} proposes several aggregators to replace this product--sum process:
    246 \begin{description}
    247 	\item[Mean aggregator] The neighbors are averaged and then multiplied by a single filter \(\mtrx{W}^{(l)}\):
    248 		\begin{equation*}
    249 			\operatorname{aggregate}_\text{mean}^{(\ell+1)}(v) = \sigmoid\left(\mtrx{W}^{(\ell)} \frac{1}{\gfdegree(v)+1} \sum_{u\in\gfneighbors(v)\cup\{v\}} \hspace{-3mm} \vctr{h}_u^{(\ell)}\right).
    250 		\end{equation*}
    251 		A spatial \textsc{gcn} using this aggregator is close to the \textsc{gcn} of \textcite{gcn_spectral_semi} with \(K=1\) presented in Section~\ref{sec:graph:spectral gcn}.
    252 	\item[L\textmd{\textsc{stm}} aggregator]
    253 		An \textsc{lstm} (Section~\ref{sec:context:lstm}) is run through all neighbors with the final hidden state used as the output of the layer.
    254 		\begin{equation*}
    255 			\operatorname{aggregate}_\textsc{lstm}^{(\ell+1)}(v) = \operatorname{\textsc{lstm}}^{(\ell)}\left(\left(\vctr{h}_u^{(\ell)}\right)_{u\in \gfneighbors(v)}\right)_{\gfdegree(v)}.
    256 		\end{equation*}
    257 		Since \textsc{lstm}s are not permutation-invariant, the order in which the neighbors are presented is important.
    258 	\item[Pooling aggregator]
    259 		A linear layer is applied to all neighbors which are then pooled through a \(\max\) operation.
    260 		\begin{equation*}
    261 			\operatorname{aggregate}_\text{max}^{(\ell+1)}(v) = \max\left(\left\{\,\mtrx{W}^{(\ell)}\vctr{h}_u^{(\ell)} + \vctr{b}^{(\ell)} \middlerel{|} u\in\gfneighbors(v)\,\right\}\right).
    262 		\end{equation*}
    263 		Note that the maximum is applied feature-wise.
    264 \end{description}
    265 Using one of these aggregator, a Graph\textsc{sage} layer performs the three following operations for all vertices \(v\in V\):
    266 \begin{marginparagraph}
    267 	As usual the matrices \(\mtrx{W}^{(l)}_i\) are trainable model parameters.
    268 \end{marginparagraph}
    269 \begin{align*}
    270 	\vctr{a}_v^{(\ell+1)} & \gets \operatorname{aggregate}^{(\ell+1)}(v) \\
    271 	\vctr{h}_v^{(\ell+1)} & \gets \sigmoid\left(\mtrx{W}_1^{(\ell)} \vctr{h}_v^{(\ell)} + \mtrx{W}_2^{(\ell)} \vctr{a}_v^{(\ell+1)}\right) \\
    272 	\vctr{h}_v^{(\ell+1)} & \gets \vctr{h}_v^{(\ell+1)} \divslash \big\|\vctr{h}_v^{(\ell+1)}\big\|_2 .
    273 \end{align*}
    274 However, this approach still performs poorly when the graph is irregular.%
    275 \sidenote{
    276 	In graph theory, a \(k\)-regular graph is a graph where all vertices have degree \(k\).
    277 	By irregular, we mean that the distribution of vertices degrees has high variance; we don't use the term in its formal ``highly irregular'' meaning.
    278 	This is indeed the case in scale-free graphs, as their variance is infinite when \(\gamma<3\).
    279 }
    280 In particular, high-degree vertices---such as ``United States'' in \textsc{t-re}x as described in Section~\ref{sec:graph:analysis}---incur significant memory usage.
    281 To solve this, Graph\textsc{sage} proposes to sample a fixed-size neighborhood for each vertex during training.
    282 Their representation is therefore computed from a small number of neighbors.
    283 Since \(L\) layers of Graph\textsc{sage} produce \(L\)-localized representations, vertices need to be sampled at most at distance \(L\) of the vertex for which we want to generate a representation.
    284 \Textcite{graphsage} propose an unsupervised negative sampling loss to train their \textsc{gcn} such that adjacent vertices have similar representations:
    285 \begin{equation}
    286 	\loss{gs} = \sum_{(u, v)\in E}
    287 	\log \sigmoid\left(\vctr{z}_v\transpose \vctr{z}_u\right)
    288 	- \gamma \expectation_{v'\sim\uniformDistribution(V)}
    289 		\left[ \log \sigmoid\left(-\vctr{z}_{v'}\transpose \vctr{z}_u\right) \right]
    290 	\label{eq:graph:graphsage loss}
    291 \end{equation}
    292 where \(\mtrx{Z} = \mtrx{H}^{(L)}\) is the activation of the last layer and \(\gamma\) is the number of negative samples.
    293 
    294 One of the advantages of Graph\textsc{sage} compared to the approach of \textcite{gcn_spectral_semi} is that it is inductive, whereas the spectral \textsc{gcn} presented in Section~\ref{sec:graph:spectral gcn} is transductive.
    295 Indeed, in the spectral approach, the filter is trained for a specific eigenvectors matrix \(\mtrx{U}\) which depends on the graph.
    296 If the graph changes, everything must be re-trained from scratch.
    297 In comparison, the parameters learned by Graph\textsc{sage} can be reused for a different graph without any problem.
    298 
    299 A limitation of Graph\textsc{sage} is that the contribution of each neighbor to the representation of a vertex \(v\) is either fixed at \(1\divslash (\gfdegree(v)+1)\) (with the mean aggregator) or not modeled explicitly.
    300 The same can be observed with the model of \textcite{gcn_spectral_semi}, where the representation of each neighbor \(u\) is nonparametrically weighted by \(1\divslash\sqrt{\gfdegree(v)+\gfdegree(u)}\).
    301 
    302 In contrast, graph attention network (\textsc{gat}, \citex{graph_attention_network}) proposes to parametrize this weight with a model similar to the attention mechanism presented in Section~\ref{sec:context:attention}.
    303 The output is built using an attention-like%
    304 \sidenote{
    305 	\Textcite{graph_attention_network} actually propose to use multi-head attention (Section~\ref{sec:context:transformer attention}).
    306 	We describe their model with a single attention head for ease of notation.
    307 }
    308 convex combination of transformed neighbors' representations:
    309 \begin{equation*}
    310 	\vctr{h}^{(\ell+1)}_v \gets \sigmoid\left(\sum_{u\in\gfneighbors(v)\cup\{v\}} \hspace{-3mm} \alpha^{(\ell)}_{vu} \mtrx{W}^{(\ell)} \vctr{h}_u^{(\ell)}\right),
    311 \end{equation*}
    312 where \(\alpha^{(\ell)}_{vu}\), the attention given by \(v\) to neighbor \(u\) at layer \(\ell\), is computed using a softmax:
    313 \begin{marginparagraph}
    314 	LeakyReLU \parencite{leakyrelu} is a variant of ReLU where the negative domain is linear with a small slope instead of being mapped to zero:
    315 	\begin{equation*}
    316 		\operatorname{LeakyReLU}(x) =
    317 		\begin{cases}
    318 			x & \text{if } x>0, \\
    319 			0.01x & \text{otherwise}. \\
    320 		\end{cases}
    321 	\end{equation*}
    322 \end{marginparagraph}
    323 \begin{equation*}
    324 	\alpha^{(\ell)}_{vu} \propto \exp \operatorname{LeakyReLU}\left(\vctr{g}^{(\ell)\transposesym}
    325 	\begin{bmatrix}
    326 		\mtrx{W}^{(\ell)}_\textsc{gat} \vctr{h}_v \\
    327 		\mtrx{W}^{(\ell)}_\textsc{gat} \vctr{h}_u
    328 	\end{bmatrix} \right).
    329 \end{equation*}
    330 As usual, the matrices \(\mtrx{W}\) are parameters, as well as the vector \(\vctr{g}\) which is used to combine the representations of the two vertices into a scalar weight.
    331 
    332 While \textsc{gat} and Graph\textsc{sage} can be trained in an unsupervised fashion following Equation~\ref{eq:graph:graphsage loss}, they can also be used as building blocks for larger models, similarly to how we use \textsc{cnn} in Chapter~\ref{chap:fitb}.
    333 Coupled with the fact that they have a simpler theoretical background and are easier to implement, spatial methods have become ubiquitous to graph-based approaches in the last few years.
    334 
    335 \subsection{\textsc{gcn} on Relation Graphs}
    336 \label{sec:graph:r-gcn}
    337 All the work introduced in the above sections is about simple undirected graphs \(G=(V, E)\).
    338 In contrast, in Section~\ref{sec:graph:encoding}, we encoded the relation extraction problem on attributed multigraphs \(G=(\entitySet, \arcSet, \gfendpoints, \gfrelation)\).
    339 Some works propose to extend \textsc{gcn} to the case of multigraphs, especially when dealing with knowledge bases.%
    340 \sidenote{
    341 	In this case, the multigraph is simply labeled since the set of relations is finite.
    342 	In contrast, in the relation extraction problem, the multigraph is attributed.
    343 	The arcs are associated with a sentence from an infinite set of possible sentences.
    344 }
    345 This is the case of \textsc{r-gcn} \parencitex{rgcn}, a graph convolutional network for relational data.
    346 The input graph is not labeled with sentences (\(\gfsentence\)) since \textsc{r-gcn} intents to model a knowledge base \(\kbSet\).
    347 This means that while \(G\) is a multigraph, the subgraphs \(G_\gfsr\) are simple graphs for all relations \(r\in\relationSet\).
    348 \textsc{r-gcn}s exploit this by using a separate \textsc{gcn} filter for each relation.
    349 An \textsc{r-gcn} layer can be defined as:
    350 \begin{marginparagraph}
    351 	Note that only the outgoing neighbors \(\gfneighborsrr\) are taken since for each incoming neighbor labeled \(r\), there is an outgoing one labeled \(\breve{r}\).
    352 \end{marginparagraph}
    353 \begin{equation}
    354 	\vctr{h}^{(\ell+1)}_v \gets \sigmoid\left(\mtrx{W}_0^{(\ell)} \vctr{h}_v^{(\ell)} + \sum_{r\in\relationSet}\sum_{u\in\gfneighborsrr(v)} \mtrx{W}_r^{(\ell)} \vctr{h}_u^{(\ell)} \right),
    355 	\label{eq:graph:r-gcn layer}
    356 \end{equation}
    357 \begin{marginparagraph}
    358 	Paralleling the notations used for \textsc{cnn}s in Section~\ref{sec:context:cnn}, we use \(d\) to denote the dimension of embeddings at layer \(\ell\) and \(d'\) for the dimension at layer \(\ell+1\).
    359 	More often than not, the same dimension is used at all layers \(d'=d\).
    360 	In the following, we use \(d\) as a generic notation for embedding and latent dimensions.
    361 \end{marginparagraph}
    362 where \(\mtrx{W}_0\in\symbb{R}^{d'\times d}\) is used for the (implicit) self-loop, while \(|\relationSet|\) different filters \(\mtrx{W}_r\in\symbb{R}^{d'\times d}\) are used for capturing the arcs.
    363 With highly multi-relational data, the number of parameters grow rapidly since a full matrix needs to be estimated for all relations, even rare ones.
    364 To address this issue, \textcite{rgcn} propose to either constrain the matrices \(\mtrx{W}_r\) to be block-diagonal, or to decompose them on a small basis \(\tnsr{Z}^{(\ell)}\in\symbb{R}^{B\times d'\times d}\):
    365 \begin{equation*}
    366 	\mtrx{W}^{(\ell)}_r = \sum_{b=1}^{B} a_{rb}^{(\ell)} \mtrx{Z}_b^{(\ell)},
    367 \end{equation*}
    368 where \(B\) is the size of the basis and \(\vctr{a}_r\) are the parametric weights for the matrices \(\mtrx{W}_r\).
    369 
    370 \Textcite{rgcn} evaluate their model on two tasks.
    371 First, they evaluate on an entity classification task using a simple softmax layer with a cross-entropy loss on top of the vertex representation at the last layer (\(\mtrx{H}^{(L)}\) as defined by Equation~\ref{eq:graph:r-gcn layer}).
    372 Second, more closely related to relation extraction, they evaluate on a relation prediction task.
    373 \begin{marginparagraph}
    374 	This is similar to the evaluation of TransE reported in Section~\ref{sec:context:transe}; except that instead of predicting a missing entity in a tuple \((e_1, r, e_2)\in\kbSet\), the model must predict the missing relation, assuming \hypothesis{1-adjacency} in the process.
    375 \end{marginparagraph}
    376 Given a pair of entity \((e_1, e_2)\in\entitySet^2\), the model must predict the relation \(r\in\relationSet\) between them, such that \((e_1, r, e_2)\in\kbSet\).
    377 To this end, \textcite{rgcn} employ the DistMult model which can be seen as a \textsc{rescal} model (Section~\ref{sec:context:rescal}) where the interaction matrices are diagonal.
    378 The energy of a fact is defined as:
    379 \begin{equation*}
    380 	\psi_\text{DistMult}(e_1, r, e_2) = \vctr{u}_{e_1}\transpose \mtrx{C}_r \vctr{u}_{e_2},
    381 \end{equation*}
    382 where \(\vctr{u}_e\) is the embedding of the entity at the last layer of the \textsc{r-gcn}: \(\vctr{u}_e=\vctr{h}^{(L)}_e\) and \(\mtrx{C}_r\in\diagonal(\symbb{R}^d)\) is a diagonal matrix parameter.
    383 The probability associated to a fact by DistMult is proportional to the exponential of the energy function \(\psi_\text{DistMult}\).
    384 Therefore, a missing relation between \(e_1, e_2\in\entitySet\) can be predicted by taking the softmax over relations \(r\in\relationSet\) of \(\psi_\text{DistMult}(e_1, r, e_2)\).
    385 \textsc{r-gcn}s are trained using negative sampling (Section~\ref{sec:context:negative sampling}) on the entity classification and relation prediction tasks.
    386 This is similar to the training of TransE, where the main difference is that the entity embeddings are computed using \textsc{r-gcn} layers instead of being directly fetched from an entity embedding matrix.
    387 
    388 A limitation of \textsc{r-gcn}s is that they only rely on vertices' representation.
    389 Even when the evaluation involves the classification of arcs (as is the case with relation prediction), this is only done by combining the representations of the endpoints (using DistMult).
    390 
    391 Several works build upon \textsc{r-gcn}.
    392 \textsc{gp-gnn} \parencitex{gp-gnn} applies a similar model to the supervised relation extraction task.
    393 In this case, the graph is attributed with sentences instead of relations; therefore, the weight matrices \(\mtrx{W}_r\) are generated from the sentences instead of using an index of all possible relations.
    394 They apply their model to Wikipedia distantly supervised by Wikidata.
    395 However, the classification is still made from the representation of the endpoints of arcs.
    396 Related work also appears in the \emph{heterogeneous graph} community \parencitex{heterogeneous_attention, heterogeneous_transformer}.
    397 Heterogeneous graphs are graphs with labels on both vertices and arcs.
    398 The model proposed by \textcite{heterogeneous_transformer} is similar to \textsc{r-gcn} with an attention mechanism more akin to the transformer's attention (Section~\ref{sec:context:transformer attention}) than classical attention (Section~\ref{sec:context:attention}).
    399 The canonical evaluation datasets of this community are citation graphs.
    400 Vertices are assigned labels such as ``people,'' ``article'' and ``conference,'' while arcs are labeled with a small number of domain-specific relations: \textsl{author}, \textsl{published at}, \textsl{cite}, etc.
    401 The evaluation task typically corresponds to entity prediction.
    402 
    403 \subsection{Weisfeiler--Leman Isomorphism Test}
    404 \label{sec:graph:weisfeiler-leman}
    405 \begin{marginfigure}[-18mm]
    406 	\centering
    407 	\input{mainmatter/graph/isomorphism.tex}
    408 	\scaption[Example of isomorphic graphs.]{
    409 		Example of isomorphic graphs.
    410 		Each vertex \(i\) in the first graph corresponds to the \(i\)-th letter of the alphabet in the second graph.
    411 		Alternatively, these graphs have nontrivial automorphism, for example, by mapping vertex \(i\) to vertex \(9-i\).
    412 		\label{fig:graph:isomorphism}
    413 	}
    414 \end{marginfigure}
    415 
    416 In this section, we introduce the theoretical background of \textsc{gcn}s.
    417 This is of particular interest to us since this theoretical background is more closely related to unsupervised relation extraction than \textsc{gcn}s can be at first glance.
    418 As stated in the introduction to the thesis, relations emerge from repetitions.
    419 In particular, we expect that two identical (sub-)graphs convey the same relations.
    420 However, testing whether two graphs are identical is a complex problem.
    421 Indeed, we have to match each of the \(n\) vertices of the first graph to one of the \(n\) possibilities in the second graph.
    422 Naively, we need to try all \(n!\) possibilities.
    423 This is known as the graph isomorphism problem.
    424 Two simple graphs \(G_1 = (V_1, E_1)\), \(G_2 = (V_2, E_2)\) are said to be isomorphic (\(G_1\simeq G_2\)) iff there exists a bijection \(f\colon V_1\to V_2\) such that \((u, v)\in E_1 \iff (f(u), f(v))\in E_2\).
    425 Figure~\ref{fig:graph:isomorphism} gives an example of two isomorphic graphs.
    426 
    427 The various \textsc{gcn} methods introduced thus far can be seen as generalizations of the Weisfeiler--Leman%
    428 \sidenote[][-1cm]{
    429 	Often spelled Weisfeiler--Lehman, \textcite{leman_spelling} indicates that Andreĭ Leman preferred to transliterate his name without an ``h.''
    430 }
    431 isomorphism test \parencitex{weisfeiler-leman}, which tests whether two graphs are isomorphic.
    432 The \(k\)-dimensional Weisfeiler--Leman isomorphism test (\(k\)-dim \textsc{wl}) is a polynomial-time algorithm assigning a color to each \(k\)-tuple of vertices%
    433 \sidenote{An ordered sequence of \(k\) vertices, that is an element of \(V^k\), not necessarily connected.}
    434 such that two isomorphic graphs have the same coloring.
    435 With a bit of work, the general \(k\)-dim \textsc{wl} algorithm can be implemented in \(O(k^2 n^{k+1}\log n)\) \parencite{weisfeiler-leman_complexity}.
    436 However, there exist pairs of graphs that are not isomorphic, yet are assigned with the same coloring by the Weisfeiler--Leman test \parencitex{weisfeiler-leman_fail}[-2.5mm].
    437 At the time of writing, the precise membership of the graph isomorphism problem with respect to the polynomial complexity classes is still conjectural.
    438 No polynomial-time algorithm nor reduction from \textsc{np}-complete problems are known.
    439 This makes graph isomorphism one of the prime candidates for the \textsc{np}-intermediate complexity class.%
    440 \sidenote{
    441 	The class of \textsc{np} problems neither in \textsc{p} nor \textsc{np}-complete.
    442 	It is guaranteed to be non-empty if \(\textsc{p}\neq\textsc{np}\).
    443 	Clues for the \textsc{np}-intermediateness of the graph isomorphism problem can be found in the fact that the counting problem is in \textsc{np} \parencite{gi_counting} and more recently, from the fact that a quasi-polynomial algorithm exists \parencite{gi_quasipoly}.
    444 }
    445 
    446 \begin{algorithm}
    447 	\centering
    448 	\begin{minipage}[b]{8cm}
    449 		\input{mainmatter/graph/Weisfeiler-Leman.tex}
    450 	\end{minipage}
    451 	\scaption[The Weisfeiler--Leman isomorphism test.]{
    452 		The Weisfeiler--Leman isomorphism test.
    453 		The double braces \(\lMultiBrace\ \rMultiBrace\) denote a multiset.
    454 		Since \(\symfrak{I}_\ell\) is indexed with the previous coloring \(\chi_{\ell-1}(\vctr{x})\) of the vertices---alongside \(c_\ell(x)\)---the number of color classes is strictly increasing until the last iteration when it remains constant.
    455 		Since the last coloring is stable, we refer to it as \(\chi_\infty\).
    456 		\label{alg:graph:weisfeiler-leman}
    457 	}
    458 \end{algorithm}
    459 
    460 The general \(k\)-dim \textsc{wl} test is detailed in Algorithm~\ref{alg:graph:weisfeiler-leman}.
    461 It is a refinement algorithm, which means that at a given iteration, color classes can be split, but two \(k\)-tuples with different colors at iteration \(\ell\) can't have the same color at iteration \(\ell'>\ell\).
    462 Initially, all \(k\)-tuples \(x\) are assigned a color according to their isomorphism class \(\operatorname{iso}(x)\).
    463 We define the isomorphism class through an equivalence relation.
    464 For two \(k\)-tuples \(\vctr{x}, \vctr{y}\in V^k\), \(\operatorname{iso}(x)=\operatorname{iso}(y)\) iff:%
    465 \sidenote{
    466 	To avoid having to align two colorings, the \textsc{wl} algorithm is usually run on the disjoint union of the two graphs.
    467 	So, strictly speaking, it tests for automorphism (isomorphic endomorphism).
    468 	Therefore we can assume \(\vctr{x}\) and \(\vctr{y}\) are from the same vertex set \(V\).
    469 }
    470 \begin{itemize}
    471 	\item \(\forall i, j \in [1, \dotsc, k]: x_i=x_j \iff y_i=y_j\)
    472 	\item \(\forall i, j \in [1, \dotsc, k]: (x_i, x_j)\in E \iff (y_i, y_j)\in E\)
    473 \end{itemize}
    474 Intuitively, this checks whether \(x_i \mapsto y_i\) is an isomorphism for the subgraphs built from the \(k\) vertices \(\vctr{x}\) and \(\vctr{y}\).
    475 This is not the same as the graph isomorphism problem since here, the candidate isomorphism is given, we don't have to test the \(k!\) possibilities.
    476 
    477 The coloring of \(\vctr{x}\in V^k\) is refined at each step by juxtaposing it with the coloring of its neighbors \(\gfneighbors^k(\vctr{x})\).
    478 We need to reindex the new colors at each step since the length of the color strings would grow exponentially otherwise.
    479 The set of neighbors%
    480 \sidenote{
    481 	Note that the kind of neighborhood defined by \(\gfneighbors^k\) completely disregards the edges in the graph.
    482 	For this reason, it is sometimes called the \emph{global neighborhood}.
    483 }
    484 of a \(k\)-tuple for \(k\geq 2\) is defined as:
    485 \begin{equation*}
    486 	\gfneighbors^k(\vctr{x}) = \left\{\, \vctr{y}\in V^k \middlerel{|} \exists i\in[1,\dotsc,k]: \forall j\in[1,\dotsc,k]: j\neq i \implies x_j=y_j \,\right\}.
    487 \end{equation*}
    488 In other words, the \(k\)-tuples \(\vctr{y}\) neighboring \(\vctr{x}\) are those differing by at most one vertex with \(\vctr{x}\).
    489 
    490 The 1-dim \textsc{wl} test is also called the \emph{color refinement} algorithm.
    491 In this case, \(\gfneighbors^1(x)\) is simply \(\gfneighbors(x)\) the set of neighbors of \(x\).
    492 The isomorphism class of a single vertex is always the same, so \(\chi_0\) assigns the same color to all vertices.
    493 The first iteration of the algorithm groups vertices according to their degree (the multiplicity of the sole element in the multiset \(c_1(x)\)).
    494 The second iteration \(\chi_2\) then colors each vertex according to its degree \(\chi_1\) and the degree of its neighbors \(c_2\).
    495 And so on and so forth until \(\chi\) does not change anymore.
    496 
    497 The \textsc{gcn} introduced in the previous sections can be seen as variants of the 1-dim \textsc{wl} algorithm where the index \(\symfrak{I}_\ell\) is replaced with a neural network such as \(\operatorname{aggregate}^{(\ell)}_\text{mean}\) given in Section~\ref{sec:graph:spatial gcn}.
    498 In this case \(\chi_\ell\) corresponds to \(\mtrx{H}^{(\ell)}\) the activations at layer \(\ell\).